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SUMMARY 


Data  analyses,  to  be  comprehensive  and  have  low  risk  for  missing  critical  information  in  the 
data,  need  to  be  deep.  One  property  of  deep  analysis  is  that  the  data  are,  in  part,  analyzed  in 
detail  at  their  finest  granularity.  It  is  certainly  true  that  analyzing  summary  statistics  is  also 
important.  But  analyzing  just  summary  statistics,  sometimes  done  to  enable  computation, 
present  a  high  risk  for  Deep  analysis  for  big  data  and  high  computational  complexity  of  analytic 
methods.  A  new  approach  to  data  analysis  is  required  that  also  provides  for  feasible,  practical 
computation,  which  in  turn,  means  analytic  methods  must  be  computed  in  parallel.  Divide  and 
Recombine  (D&R)  provides  this.  The  DeltaRho  software  enables  data  analysts  to  carry  out 
D&R  in  practice  without  having  to  manage  the  details  of  parallel  computation,  to  that  they  can 
spend  a  large  majority  to  their  time  thinking  about  the  data  and  its  analysis  and  a  minority  of  their 
time  programming  the  analysis.  The  analyst  programs  in  R,  the  most  used  system  for  data 
analysis.  Github  is  the  development  site  for  the  DeltaRho  software.  The  web  site 
www.deltarho.org  provides  documentation  for  DeltaRho  and  how  to  download  it. 

The  measure  of  success  is  gauged  by  analysis  of  big  datasets  with  high  computational 
complexity.  We  have  ourselves  had  a  number  of  successes.  This  has  been  demonstrated  by  our 
own  analyses.  Two  examples  are:  10,615,054,608  queries  to  the  Spamhaus  Internet  Protocol 
(IP)  address  blocklisting  service;  and  50,632  3-hour  satellite  rain-rate  measurements  at  576,000 
locations  from  the  Tropical  Rainfall  Measurement  Mission.  We  have  run  Computational 
Performance  Measurement  and  Analysis  (CPM&A)  designed  experiments  that  show  fast 
performance.  On  a  cluster  with  10  nodes,  264  cores,  528  gigabytes  (GB)  of  random  access 
memory  (RAM),  and  88  terabytes  (TB)  of  disk  space  we  ran  a  logistic  regression  with  1  TB  of 
data  and  1  million  subsets.  We  measured  the  elapsed  time  to  read  subsets  into  memory  and  form 
the  subsets.  We  measured  the  elapsed  time  to  carry  out  the  computation  of  the  logistic 
regressions  of  subsets  and  the  recombination  to  get  a  single  result.  The  two  times  were  12.1 
minutes  and  6.0  minutes,  respectively,  a  total  of  18.1  minutes.  That  is  a  practical  amount  of  time 
for  a  data  analyst  to  wait  for  a  1  TB  dataset. 
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INTRODUCTION 


In  Divide  and  Recombine  (D&R),  the  analyst  divides  the  data  into  subsets  by  a  D&R  division 
method.  Each  analytic  method  is  applied  to  each  subset,  independently,  without  communication. 
The  outputs  of  each  analytic  method  are  recombined  by  a  D&R  recombination  method. 
Sometimes  the  goal  is  one  result  for  all  of  the  data,  such  as  a  logistic  regression;  D&R  theory 
and  methods  seek  division  and  recombination  methods  to  optimize  the  statistical  accuracy. 

Much  more  common  in  practice  is  a  division  based  on  the  subject  matter.  The  data  are  divided 
by  conditioning  on  variables  important  to  the  analysis.  In  this  case  the  outputs  can  be  the  final 
result,  or  further  analysis  is  carried  out,  an  analytic  recombination. 

D&R  computation  is  mostly  embarrassingly  parallel,  the  simplest  parallel  computation. 

DeltaRho  software  is  an  open-source  implementation  of  D&R.  (See  www.deltarho.org.)  The 
front  end  is  the  R  package  datadr,  which  is  a  language  for  D&R.  It  makes  programming  D&R 
simple.  At  the  back  end,  running  on  a  cluster,  is  a  distributed  database  and  parallel  compute 
engine  such  as  Hadoop,  which  spreads  subsets  and  outputs  across  the  cluster,  and  executes  the 
analyst  R  and  datadr  code  in  parallel.  The  R  package  R  and  Hadoop  Integrated  Programming 
Environment  (RHIPE)  provides  communication  between  datadr  and  Hadoop.  DeltaRho  protects 
the  analyst  from  management  of  parallel  computation  and  database  management. 

It  is  natural  to  divide  data  based  on  the  subject  matter.  The  data  are  divided  by  conditioning  on 
variables  important  to  the  analysis.  For  example,  in  a  collaboration  between  faculty  and  students 
in  Purdue's  Statistics  Department,  and  Wen-wen  Tung  in  the  Earth,  Atmospheric,  Planetary 
Sciences  Department  and  her  students,  we  are  analyzing  a  satellite  dataset  from  the  Tropical 
Rainfall  Measuring  Mission.  There  are  50,632  3-hour  rainfall  measurements  at  each  of  576,000 
locations.  One  division  is  by  location  across  time,  so  there  are  576,000  subsets  with  50,632 
measurements  per  subset.  Another  is  by  time  across  locations,  so  there  are  50,632  subsets  with 
576,000  subsets  per  location.  Subject  matter  division  is  just  as  valid  for  small  datasets.  It  has 
been  widely  practiced  in  the  past  and  is  a  statistical  best  practice.  For  D&R  we  use  such  division 
both  for  a  best  practice  and  for  computational  gain.  Subject-matter  division  is  the  most  used  in 
practice. 

In  sampling  division  each  subset  is  seen  as  a  sample  of  the  data.  Subsets  are  replicate  samples. 
For  example,  we  can  carry  out  random  replicate  division:  choose  subsets  randomly.  We  seek  a 
single  result  for  all  of  the  data.  There  is  a  statistical  division  method  and  a  statistical 
recombination  method.  The  statistical  accuracy  of  the  D&R  result  is  typically  less  that  that  of  the 
direct  all-data  result.  D&R  research  in  statistical  theory  seeks  to  maximize  the  statistical 
accuracy  of  D&R  results.  The  accuracy  depends  on  the  division  method  and  the  recombination 
method.  A  community  of  researchers  in  this  area  is  developing,  not  just  those  on  our  D&R  team. 

Computational  performance  of  our  software  is  a  critical  matter.  However,  today,  Computational 
Performance  Measurement  and  Analysis  (CPM&A)  for  Big-Data  and  high  computational 
complexity  is  often  lacking  in  rigor,  and  not  sufficiently  informative.  Performance  tests  typically 
use  a  few  low-level  computations  such  as  sort,  which  are  not  informative  for  data  analyses.  We 
study  response  times  of  analytic  methods,  which  are  directly  what  an  analyst  uses  and  wants 
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to  run  as  fast  a  possible.  Benchmark  testing  today  often  fails  to  control  for  factors  that  are 
important  for  comparing  aspects  of  two  systems.  For  example,  Hadoop  configuration  parameters 
can  have  a  big  impact.  Pilot  experiments  show  there  are  strong  interactions  among  factors.  Little 
attention  today  is  given  to  interactions.  We  have  been  addressing  these  problems  by  multifactor 
experiments  with  many  factors. 

Experimentation  is  challenging.  Knowledge  of  big-data  systems  is  needed  to  determine  salient 
factors.  Interactions  are  not  readily  quantifiable  from  the  knowledge.  So  empirical  model 
building  is  necessary.  Running  a  design  on  a  cluster  can  be  complex.  Some  factors  have  levels 
that  can  be  changed  only  by  system  administrators.  One  current  topic  here  is  comparing  the 
Spark  back  end  and  Hadoop;  to  keep  factors  fixed  for  each  system,  both  must  be  installed  on  a 
cluster.  Attention  must  be  paid  to  limiting  all  other  processes  other  than  the  operating  system. 
Care  must  be  taken  to  ensure  replicate  runs  are  independent  and  not  affected  by  aspects  of  the 
systems  that  can  create  trends  in  duplicate  runs.  Many  matters  need  to  be  evaluated  by  running 
on  more  that  one  cluster. 

One  critical  aspect  of  our  work  is  that  through  out  the  entire  program  we  analyzed  big  datasets. 
Some  of  the  datasets  were  those  required  by  the  XDATA  program.  But  many  others  were  those 
analyzed  by  members  of  the  team  who  were  matter  experts  in  certain  subject  matter  areas.  These 
were  live.  "Live"  means  the  success  of  the  analysis  is  judged  by  the  subject  matter  results.  Live 
analyses  serve  as  an  effective  heat  engine  for  research  ideas  that  solve  real  problems,  and  serve 
as  a  test  bed  for  those  ideas.  Analyses  need  to  be  live  because  it  is  important  to  judge  research 
based  on  how  well  it  increases  subject-matter  knowledge  and  the  effort  required  to  get  the 
increases.  There  are  notions  of  statistical  accuracy  that  can  be  important,  too,  but  those  become 
subject  to  the  same  criterion  of  increasing  subject-matter  knowledge. 
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METHODS,  ASSUMPTIONS,  AND  PROCEDURES 


Software  For  Divide  and  Recombine 

Much  of  our  work  in  XDATA  focused  on  building  software  prototypes  implementing  the  Divide 
&  Recombine  research.  While  in  the  XDATA  program,  we  developed  several  iterations  of 
components  of  a  software  stack  for  analysis  and  visualization  of  large  complex  data  and  used 
this  to  analyze  the  XDATA  challenge  datasets  as  well  as  our  own  research  datasets.  In  addition 
to  building  the  components,  we  also  initiated  a  project,  ultimately  named  DeltaRho,  to 
encapsulate  the  entire  body  of  work  and  build  a  community  of  users. 

The  components  developed  over  the  term  of  the  XDATA  program  are  the  R  and  Hadoop 
Integrated  Programming  Environment  (RHIPE),  the  datadr  R  package,  the  trelliscope  R  package, 
the  next  iteration  trelliscopejs  package,  and  an  R  interface  to  Anaconda,  Inc.'s  Bokeh 
visualization  library,  rbokeh.  In  the  process  of  creating  these  packages  and  performing  our 
XDATA  data  analysis  work,  several  other  related  supporting  software  components  were  created. 
The  effort  for  each  of  these  components  is  described  in  more  detail  below. 

datadr 

The  datadr  R  package  provides  a  simple  interface  to  D&R  operations.  It  allows  an  analyst  to 
operate  wholly  from  within  R  on  very  large  datasets,  using  R  code  to  compute  on  the  data. 
Instead  of  MapReduce,  which  is  not  a  logical  way  to  think  for  many  data  analysis  tasks,  datadr 
provides  an  interface  to  D&R,  a  simple  concept  very  familiar  to  statisticians,  while  in  the 
background  mapping  D&R  tasks  into  the  appropriate  MapReduce  steps.  The  datadr  package 
provides  generic  interfaces  to  several  D&R  workflows,  as  well  as  constructs  for  reading  writing 
data,  computing  common  summary  statistics,  statistical  distributions,  and  other  aggregations  in  a 
divison-agnostic  manner. 

Our  initial  version  of  datadr  required  logic  for  plugging  in  different  back  ends  such  as  Hadoop, 
local  disk,  or  local  memory,  to  be  written  from  the  ground  up  for  each  case,  which  made  datadr 
less  extensible  and  more  difficult  to  maintain.  In  a  second  iteration,  we  created  a  back  end 
agnostic  interface,  making  it  possible  for  datadr  to  harness  new  technology  (such  as  Spark)  much 
more  straightforward  as  it  comes  along.  Regardless  of  the  back  end,  coding  is  done  entirely  in  R 
and  data  is  represented  as  R  objects.  Datadr  currently  supports  in-memory,  local  disk  /  multicore, 
and  Hadoop  back  ends. 

We  worked  on  experimental  support  for  Apache  Spark,  demonstrating  the  back  end  agnostic 
design,  but  due  to  limitations  in  the  R/Spark  connector  package,  a  solution  ready  for  practical  use 
was  not  reached. 
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RHIPE 


RHIPE  is  the  R  and  Hadoop  Integrated  Programming  Environment.  RHIPE  allows  an  analyst  to 
run  Hadoop  MapReduce  jobs  wholly  from  within  R.  RHIPE  is  used  by  datadr  when  the  back  end 
for  datadr  is  Hadoop.  You  can  also  perform  D&R  operations  directly  through  RHIPE,  although 
in  this  case  you  are  programming  at  a  lower  level.  RHIPE  existed  before  the  XDATA  program 
but  throughout  the  course  of  XDATA  it  underwent  many  iterations  and  enhancements,  including 
migrating  to  Hadoop  2.0,  supporting  several  of  the  most  popular  Hadoop  distributions,  an 
improved  build  system,  more  and  better  documentation,  and  a  large  number  of  bug  fixes.  Work 
on  fine-tuning  Hadoop  parameters  with  RHIPE  was  carried  out  through  scaling  tests,  where  we 
assessed  the  scalability  of  RHIPE  by  analyzing  up  to  128  terabytes  of  data  on  one  of  our 
institutional  clusters. 

trelliscope 

Trelliscope  is  a  D&R  visualization  tool  based  on  Trellis  Display  that  enables  scalable,  flexible, 
detailed  visualization  of  data.  Trelliscope,  backed  by  datadr,  scales  Trellis  Display,  allowing  the 
analyst  to  break  potentially  very  large  data  sets  into  many  subsets,  apply  a  visualization  method 
to  each  subset,  and  then  interactively  sample,  sort,  and  filter  the  panels  of  the  display  on  various 
quantities  of  interest.  We  built  the  trelliscope  R  package  as  a  Shiny  application  that  displays  plots 
of  divisions  of  a  distributed  data  object,  potentially  backed  by  very  large  datasets  sitting  on 
Hadoop.  In  one  of  the  summer  challenges,  we  illustrated  the  scalability  of  this  system  by  creating 
a  Trelliscope  display  of  NxCore  data  using  over  a  million  subsets  representing  over  50  billion 
data  points. 

trelliscopejs 

The  Trelliscope  R  package  provides  a  web-based  viewer  written  as  a  Shiny  application.  There 
are  many  limitations  to  this  approach.  One  limitation  is  the  need  for  a  Shiny  server  to  be  able  to 
deploy  the  visualizations  to.  Another  major  limitation  is  that  Shiny  is  not  well-suited  for  very 
complex  UIs  that  require  quick  iterative  interactivity  within  the  browser.  The  application  became 
unwieldy  to  maintain  and  very  difficult  to  add  new  features  to  as  the  internals  were  so 
complicated.  Because  of  these  limitations,  we  rewrote  Trelliscope  from  the  ground  up  as  a 
JavaScript  viewing  engine  using  the  React  framework.  A  lot  of  early  experimentation  was  done 
with  other  JavaScript  frameworks  such  as  Ember  and  Angular.  The  trelliscopejs  JavaScript 
library  provides  a  pure- JavaScript  web-based  Trelliscope  viewer  which  theoretically  could  be 
instantiated  with  data  from  any  source.  In  our  work,  since  we  use  R,  we  created  an  R  interface  to 
create  and  populate  trelliscopejs  displays.  This  R  interface  provides  two  approaches  that  fit 
seamlessly  into  a  typical  R  users's  workflow.  One  is  the  ability  to  easily  create  a  Trelliscope 
display  to  an  existing  ggplot2  object.  The  other  approach  fits  into  the  now-popular  "tidyverse" 
ecosystem  in  R  and  allows  users  to  build  data  frames  of  plots  as  "list-columns"  of  the  data  frame 
and  easily  generate  a  display  from  the  data  frame. 


APPROVED  FOR  PUBLIC  RELEASE;  DISTRIBUTION  UNLIMITED 


5 


rbokeh 


Toward  the  end  of  the  XDATA  program,  we  determined  that  the  R  world  needed  to  have  access 
to  all  the  work  going  on  in  the  Bokeh  visualization  library  developed  by  Anaconda,  Inc.  We  built 
an  R  interface  to  Bokeh,  providing  many  of  the  features  R  users  expect  in  a  visualization  system, 
drawing  a  lot  of  inspiration  from  ggplot2  in  terms  of  providing  a  layered  plot  specification 
mechanism  that  provides  features  like  mapping  aesthetics,  automatic  legends,  etc.  This  work 
allowed  us  to  easily  specify  interactive  graphics  with  the  same  ease  as  creating  static  plots  with 
ggplot2.  Much  of  the  interactivity  we  get  for  free  thanks  to  Bokeh,  such  as  pan,  zoom,  and 
tooltips.  More  customized  interactions  were  facilitated  through  callback  support.  We  also  built  in 
support  for  rbokeh  graphics  to  interact  within  Shiny  applications. 

graphqlr 

While  working  on  trelliscopejs,  we  anticipated  a  future  where  trelliscopejs  would  be  able  to  more 
efficiently  query  the  data  it  needs  from  R  using  Relay/GraphQL.  GraphQL  is  a  backend  agnostic 
data  query  language  and  runtime  that  drastically  reduces  the  number  of  server  requests  created 
by  the  browser  by  using  a  dynamic  and  nested  query  structure.  To  preprare  for  this,  we 
developed  the  "gqlr"  package.  This  package  pulls  inspiration  from  Facebook's  graphql-js 
package  and  implements  a  full  GraphQL  server  within  R.  "gqlr"  allows  R  users  to  supply  their 
own  functions  to  satisfy  the  data  requirements  of  a  submitted  GraphQL  query,  thus  enjoying  the 
rapid  iteration  time  of  R  and  production  iteration  time  of  GraphQL. 

packagedocs 

To  help  create  a  unified  documentation  website  for  all  the  packages  in  the  DeltaRho  stack,  we 
developed  the  packagedocs  R  package,  which  automatically  generates  a  web  page  of 
documentation  and  function  reference  pages  based  on  R  Markdown  tutorial  text  and  the  R  docs 
files.  We  used  this  package  to  build  the  documentation  for  the  R  package  components  of 
DeltaRho  and  served  them  with  our  website. 

stlplus 

The  "stlplus"  package  was  used  extensively  in  many  of  the  time  series  analysis  we  performed  on 
XDATA  challenge  datasets  as  well  as  our  own  large  research  datasets.  This  package  existed 
prior  to  XDATA  as  "stl2",  but  we  migrated  the  code  base  to  a  revamped  "stlplus"  package.  We 
ported  the  old  C  code  to  C++  using  Repp  as  the  integration  mechanism.  We  revamped  the 
documentation  and  got  the  package  CRAN-ready  and  published  it  on  the  Comprehensive  R 
Archive  Network  (CRAN). 

rmote 

To  help  deal  with  a  common  issue  of  working  on  a  remote  cluster  head  node  from  a  local 
machine,  but  wanting  to  receive  graphical  output  from  that  node  while  working  within  R,  we 
built  a  package,  "rmote".  Traditional  approaches  to  viewing  graphics  while  working 
on  remote  machines  includes  XI 1  forwarding  and  using  Virtual  Network  Computing  (VNC) 
with  a  Linux  desktop  environment.  Neither  of  these  are  very  attractive  solutions  and  have  some 
major  limitations. 
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We  built  the  rmote  to  make  working  in  R  over  secure  socket  shell  (SSH)  on  a  server  a  bit  more 
pleasant  in  terms  of  viewing  output.  It  opens  a  live  connection  from  the  remote  machine  to  the 
local  machine  through  websockets,  opening  a  live-updating  web  page  on  the  local  machine  that 
updates  whenever  a  new  output  is  produced  from  R.  This  setup  allows  a  rich  set  of  graphics  to  be 
passed  back  to  the  local  machine,  including  web-based  graphics,  Shiny  apps,  etc. 

devops 

While  our  software  work  endeavored  to  make  programming  with  big  data  as  simple  as  possible, 
systems  setup  was  always  a  major  pain  point.  A  great  deal  of  development  effort  and  support 
was  focused  on  more  simplified  instantiation  of  Hadoop  clusters.  We  created  Vagrant  images 
and  dockerfiles  for  several  different  cluster  scenarios  supporting  the  DeltaRho  stack.  We  also  set 
up  scripts  to  help  easily  provision  a  full-featured  DeltaRho  /  Hadoop  cluster  on  Amazon  Web 
Services  (AWS),  and  packaged  the  script  up  into  a  one-line  setup  tool. 

Tessera  /  DeltaRho 

A  great  deal  of  time  and  effort  went  into  community  building  for  our  project.  This  began  with 
the  Tessera  project,  for  which  we  created  a  website,  tessera.io,  which  provided  an  introduction, 
installation  and  usage  documents,  the  documentation  for  all  the  packages,  and  links  to  research. 
This  project  was  superseded  by  the  DeltaRho  project,  now  hosted  at  deltarho.org. 

Future  /  Impact 

Our  work  on  D&R  research  and  software  implementation  has  had  a  significant  impact  on  the 
direction  of  statistical  computing  work  for  big  data.  Given  the  many  talks  and  tutorials  that  we 
gave  on  the  approach  and  the  software  over  the  course  of  the  XDATA  program,  the  ideas  and 
software  have  been  adopted  by  several  researchers,  and  additional  ideas  and  enhancements  have 
been  put  forward  by  some  of  these  groups.  The  datadr  package  provided  the  first  distributed  data 
object  /  distributed  data  frame  representation  in  R  which  inspired  further  work  in  this  area, 
including  the  "ddr"  package  developed  out  of  a  partnership  that  began  at  a  big  data  workshop 
that  we  participated  in  at  HP  labs.  Beyond  this  influence,  our  work  on  datadr  and  our  general 
approach  to  big  data  has  influenced  work  by  RStudio  on  R/Spark.  RStudio's  approach  is  based 
on  the  "tidyverse"  ecosystem,  which  early  on  was  not  compatible  with  the  D&R  approach. 

A  talk  and  discussions  at  the  annual  "Directions  in  Statistical  Computing"  workshop,  which  is 
organized  by  the  R  Core  Team,  led  to  the  importance  for  RStudio  to  support  arbitrary  data 
structures  and  arbitrary  R  code  execution  in  Spark,  two  necessary  components  for  a  D&R  system 
to  support.  This  work  is  still  young,  but  is  a  promising  future  for  big  data  in  R  and  for  D&R 
methods  in  R,  with  an  open  solution  that  is  based  on  more  modem  technology  (Spark)  and 
supported  by  a  more  commercial  and  sustainable  entity  (RStudio).  We  hope  that  future  D&R 
work  can  build  upon  this  infrastructure.  The  trelliscopejs  package  has  proven  to  be  very  useful 
independent  of  the  rest  of  the  DeltaRho  ecosystem. 
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Scaling  Interactive  Visualization  to  Large  Data  Volumes 

In  the  first  phase  of  our  XDATA  award,  we  investigated  methods  for  scaling  interactive 
visualization  techniques.  Data  analysts  must  make  sense  of  increasingly  large  data  sets, 
sometimes  with  billions  or  more  records.  We  developed  methods  for  interactive  visualization 
of  big  data,  following  the  principle  that  perceptual  and  interactive  scalability  should  be  limited 
by  the  chosen  resolution  of  the  visualized  data,  not  the  number  of  records.  We  mapped  out  a 
design  space  of  scalable  visual  summaries  that  use  data  reduction  methods  (such  as  binned 
aggregation  or  sampling)  to  visualize  a  variety  of  data  types.  We  then  contributed  methods  for 
interactive  querying  (e.g.,  brushing  &  linking)  among  binned  plots  through  a  combination 
of  multivariate  data  tiles  and  parallel  query  processing.  We  implemented  our  techniques  in 
imMens,  a  browser-based  visual  analysis  system  that  uses  WebGL  for  both  data  processing  and 
rendering  on  the  graphics  processing  unit  (GPU).  In  benchmarks  imMens  sustains  50  frames- 
per-second  brushing  &  linking  among  dozens  of  visualizations,  with  invariant  performance  on 
data  sizes  ranging  from  thousands  to  billions  of  records.  This  work  appeared  at  EuroVis  2013. 

To  support  effective  exploration,  it  is  often  stated  that  interactive  visualizations  should  provide 
rapid  response  times.  Indeed,  this  was  the  primary  motivation  for  imMens.  However,  we  found 
that  the  effects  of  interactive  latency  on  the  process  and  outcomes  of  exploratory  visual  analysis 
had  not  been  systematically  studied.  In  an  experiment  published  at  the  Institute  for  Electrical 
and  Electronic  Engineers  (IEEE)  InfoVis  2014  symposium,  we  measured  user  behavior  and 
knowledge  discovery  with  interactive  visualizations  under  varying  latency  conditions.  We 
observed  that  an  additional  delay  of  500  milli-seconds  (ms)  incurs  significant  costs,  decreasing 
user  activity  and  data  set  coverage.  Analyzing  verbal  data  from  think-aloud  protocols,  we  found 
that  increased  latency  reduced  the  rate  at  which  users  made  observations,  drew  generalizations 
and  generated  hypotheses.  Moreover,  we  noted  interaction  effects  in  which  initial  exposure  to 
higher  latencies  led  to  subsequently  reduced  performance  in  a  low-latency  setting.  Overall, 
increased  latency  appears  to  cause  users  to  shift  exploration  strategy,  in  turn  affecting 
performance.  Since  publication  this  study  has  been  highly  cited,  and  used  to  help  guide  and 
motivate  work  on  a  number  of  scalable  visualization  and  low-latency  data  processing  systems. 

Declarative  Languages  for  Interactive  Visualization:  The  Reactive 
Vega  Stack 

Another  thread  of  XDATA  research  concerns  systems  and  tools  for  interactive  data 
visualization,  particularly  declarative  languages  for  interactive  visualization.  Long  a  staple  of 
database  query  languages  such  as  Structured  Query  Language  (  SQL),  and  graphic  design 
languages  Hypertext  Markup  Language  (HTML)  and  Cascading  Style  Sheets  (CSS),  declarative 
approaches  to  visualization  have  only  more  recently  come  to  hold  sway.  Declarative  models  for 
visual  encoding  (mapping  data  to  visual  elements)  have  become  a  dominant  way  of  expressing 
visualizations,  providing  the  right  balance  of  expressive  power  and  usable,  domain- specific 
constructs.  While  prior  work  supports  declarative  approaches  to  static  visual  encodings,  custom 
interaction  design  is  either  unsupported  or  achieved  only  via  imperative  callbacks,  starkly 
breaking  with  a  declarative  style.  In  this  context,  we  contributed  new  models  and  corresponding 
system  implementations  that  infuse  declarative  visual  encoding  approaches  with  abstractions  for 
specifying  interaction  techniques. 
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Our  first  result  was  the  development  of  a  novel  declarative  model  for  interaction  techniques.  By 
treating  user  input  as  a  streaming  data  source,  we  combined  ideas  from  functional  reactive 
programming  and  data  stream  processing  to  model  interactive  programs  as  continuous  queries 
over  data  streams.  We  laid  out  this  model  in  papers  at  the  Association  for  Computing  Machinery 
(ACM)  User  Interface  and  Software  Technology  (UIST)  2014  symposium,  where  we  introduced 
our  declarative  model  for  interaction  and  demonstrated  its  expressivity,  and  IEEE  InfoVis  2015, 
where  we  contributed  an  implementation  of  this  model  in  the  form  of  a  reactive  dataflow 
architecture.  The  Reactive  Vega  system  supports  optimized  processing  of  streaming  data  and 
interactive  updates,  with  superior  performance  to  existing  web-based  visualization  tools  such  as 
D3.  The  implementation  involves  a  number  of  unique  aspects,  including  the  use  of  self- 
instantiating  dataflows  (a  form  of  adaptive  query  plan)  to  handle  data-dependent  control  flow. 

While  expressive  and  performant,  the  Reactive  Vega  language  can  still  require  verbose 
specifications:  while  control  flow  is  handled  by  the  model,  the  logic  of  event  handling  and  data 
processing  must  be  explicitly  provided.  As  a  result,  Reactive  Vega  is  not  well-suited  for  quickly 
generating  interactive  plots  in  the  midst  of  an  analysis  session.  Considering  this  issue  led  us  to  a 
fundamental  question.  Given  an  assignment  of  a  data  field  to  a  visual  encoding  channel  (e.g.,  x- 
position,  color,  size),  existing  declarative  visual  encoding  systems  leverage  a  combination  of 
data  type  information  (e.g.,  quantitative,  ordinal,  or  categorical)  and  smart  defaults  (e.g., 
perceptually  effective  color  palettes)  to  automatically  generate  appropriate  encoding  logic.  This 
facility  lets  users  rapidly  specify  encodings,  then  override  and  modify  default  choices  if  desired. 
We  asked:  does  an  analogous  relationship  hold  for  interaction  techniques?  For  example,  given  an 
indication  of  the  semantics  of  the  interaction,  can  one  synthesize  appropriate  event  handling 
logic? 

This  line  of  questioning  led  to  the  development  of  Vega-Lite,  a  concise  high-level  language  for 
rapidly  creating  interactive,  multi- view  visualizations.  We  devised  a  new  grammar  of  interactive 
graphics  in  which  users  specify  interactions  in  terms  of  their  semantics:  the  type  of  selection 
involved  (e.g.,  point  or  interval  selections)  and  associated  transformations.  By  default,  input 
event  handling  logic  for  populating  these  selections  is  automatically  synthesized  based  on  the 
selection  type,  applied  transforms,  and  interaction  context  (e.g.,  mouse  or  touch-based),  but 
remains  customizable.  This  selection  abstraction  then  “plugs  in”  to  visual  encodings  to  specify 
dynamic  input  data,  drive  conditional  encoding  logic,  and  define  scale  extents  (e.g.,  for  panning 
or  zooming).  Interactive  selections  in  Vega-Lite  enable  an  expressive  set  of  interaction 
techniques  in  a  surprisingly  concise  manner,  with  specifications  one  or  more  orders  of  magnitude 
more  concise  than  prior  work.  In  recognition  of  this  work,  we  received  the  Best  Paper  Award  at 
IEEE  InfoVis  2016,  a  premier  venue  for  Information  Visualization  research. 

In  terms  of  impact,  beyond  DARPA  multiple  companies,  including  Apple,  Elastic  Search,  Fitbit, 
MapD  and  others,  are  using  Vega  open  source  software.  Vega  has  been  adopted  as  a  standard  for 
adding  interactive  visualizations  to  Wikipedia  articles,  and  tools  based  on  Vega-Lite  have  been 
developed  in  a  variety  of  languages,  most  notably  the  Python  Altair  library  for  use  within  Jupyter 
Notebooks.  While  Vega  and  Vega-Lite  are  useful  in  their  own  right,  our  larger  vision  is  to  have 
these  languages  serve  as  a  foundation  for  novel  research.  Towards  this  end,  our  group  has  built  a 
number  of  research  systems  and  models  on  top  of  the  Vega  stack: 
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Lyra  (EuroVis  2014)  is  a  graphical  interface  for  creating  custom  visualization  designs 
without  writing  code,  built  on  Vega  and  Vega-Lite.  Lyra  is  more  expressive  than  interactive 
systems  like  Tableau,  allowing  designers  to  create  custom  visualizations  more  comparable  to 
hand-coded  designs  built  with  D3  or  Processing.  The  resulting  visualizations  are  realized  as 
Vega  specifications  that  can  then  be  published  and  reused  on  the  Web. 

Voyager  (presented  at  IEEE  InfoVis  2015  and  ACM  Conference  on  Human  Factors  in 
Computing  Systems  CHI  2017)  is  a  visual  analysis  tool  that  combines  both  manual 
specification  and  automatic  chart  recommendation.  Similar  to  other  tools,  Voyager  users  can 
specify  charts  manually  by  assigning  data  fields  to  visual  encoding  channels.  However,  in 
addition,  Voyager  suggests  relevant  visualizations  based  on  a  user’s  current  focus. 
Underneath  the  hood,  the  CompassQL  (Special  Interest  Group  on  Management  of  Data 
SIGMOD  2016  conference)  visualization  query  language  searches  over  thousands  of 
potential  Vega-Lite  charts,  ranks  them  using  both  statistical  and  perceptual  measures,  and 
presents  the  top-ranked  examples  in  a  recommendation  gallery.  In  studies  of  early- stage  data 
analysis,  we  found  that  this  mode  of  exploration  significantly  increases  data  coverage 
compared  to  state-of-the-art  tool  designs.  By  leveraging  Vega,  any  visualization  in  Voyager 
can  be  exported  for  further  editing,  including  design  customization  in  Lyra. 

GraphScape  (ACM  CHI  2017  Best  Paper  Nominee)  provides  a  formal  model  of  the 
relationships  among  Vega-Lite  visualizations.  GraphScape  is  a  directed  graph  model,  where 
nodes  represent  Vega-Lite  charts  and  edges  represent  specification  edits  that  turn  one  chart 
into  another.  GraphScape  edges  are  weighted  by  an  estimated  “cost”  of  the  effort  needed  to 
interpret  one  chart  having  seen  another.  This  model  enables  applications  such  as  automatic 
sequence  generation  for  presentations  and  automatic  search  for  design  alternatives.  Lor 
example,  given  an  initial  visualization  design  (such  as  a  scatter  plot),  GraphScape  can  be 
used  to  find  similar  designs  that  better  scale  to  large  data  volumes  (e.g.,  binned  aggregation 
or  sampling).  In  controlled  studies  we  have  found  that  GraphScape  sequence 
recommendations  closely  align  with  human  judgments.o 


Building  DSLs  for  Data  Analysis 

The  Stanford  team  worked  on  three  main  projects  during  the  period  of  this  grant.  The  first, 
Riposte,  was  a  vector  virtual  machine  for  the  R  programming  language.  The  second,  Terra,  was 
infrastructure  for  building  high  performance  domain- specific  languages.  The  third  was  Opt,  A 
domain- specific  language  (DSL)  for  computing  solutions  to  non-linear  least  squares  problems. 
All  these  projects  resulted  in  open  sources  releases  and  a  body  of  dedicated  users;  all  the 
systems  were  also  documented  with  publications  in  premier  conferences. 

Riposte 

Riposte  is  a  virtual  machines  for  array  processing  operations  embedded  in  the  R  programming 
language.  Vector  virtual  machines  work  well  for  long  vectors.  One  of  the  most  innovative 
features  of  the  system  is  the  optimizations  for  scalars  and  short  vectors,  which  we  term  partial 
length  specialization. 
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Another  main  focus  of  our  research  has  been  on  adapting  the  code  to  the  nature  of  the  data.  For 
example,  if  the  data  is  grouped  and  aggregated;  the  method  used  depends  on  the  number  of 
groups  and  the  number  of  elements  in  each  group. 

A  significant  amount  of  effort  was  also  put  into  investigating  new  methods  for  building  vector 
processing  virtual  machines.  R  implements  primitive  functions  in  C  code  as  part  of  the 
interpreter  itself.  Riposte  takes  a  more  principled  approach,  implementing  these  functions  in  R 
code  in  an  new  standard  library.  Where  necessary,  Riposte  defines  some  functions  using  C  code 
through  a  new  foreign  function  interface  which  permits  specifying  just  the  kernels  of  map,  fold, 
and  reduce-style  operations  in  C.  The  Riposte  interpreter  handles  the  iteration  over  these  kernels, 
allowing  us  to  perform  vector  fusion  over  library-defined  C  code. 

These  new  algorithms  and  approaches  to  building  the  system  led  to  substantially  faster 
implementations  of  many  algorithms  in  R.  We  had  hoped  to  integrate  this  work  into  the  core  R 
release,  but  the  project  was  left  partially  unfinished  because  the  team  working  on  the  project  left 
for  industry. 

Terra 

As  part  of  this  proposal  we  developed  a  new  programming  language  Terra.  Terra  is  a  low-level 
C-like  language  embedded  in  lua.  Lua  is  dynamically  typed  and  has  garbage  collection.  Terra  is 
statically  typed  and  memory  management  is  explicit.  The  Terra  sublanguage  is  based  on  LLVM 
(originally  Low  Level  Virtual  Machine,  now  an  umbrella  for  compilation  and  machine  code 
generation).  We  implement  DSLs  in  Terra  using  meta-programming.  Lua  parses  and  analyzes  the 
DSL,  and  then  translates  it  into  Terra.  Terra  is  then  jitted  (just-in-time  compilation)  into  very 
efficient  low-level  code. 

One  novel  part  of  Terra  is  the  concept  of  Exotypes,  a  type  extension  system.  Exotypes  create 
types  that  are  as  fast  as  native  implementations  in  statically  typed  languages  such  as  C,  but  are 
embedded  in  dynamically  typed  languages  using  a  meta-object  protocol.  Using  Exotypes,  we 
have  implemented  Array (T),  a  type  constructor  that  builds  a  high-performance  array 
implementation  where  each  element  of  the  array  is  of  type  T.  The  Exotype  compiler  creates  an 
efficient  implementation  using  a  combination  of  blocking  and  vectorization,  much  like  ATLAS 
(automatically  tuned  linear  algebra  software)  and  other  array  processing  meta-compilers  use  to 
achieve  efficiency.  Exotypes  allow  us  to  build  systems  like  Riposte  much  more  efficiently. 

We  use  terralang.org  for  the  documentation  and  portal  into  the  system.  Terra  itself  is  maintained 
on  github. 

Opt 

The  final  major  project  was  a  DSL  called  Opt  designed  to  solve  non-linear  least  squares 
optimization  problems  over  large  regular  arrays  or  graph  data  structures.  Users  enter  the  function 
to  be  optimized  as  a  simple  expression  representing  an  energy  to  be  minimized.  This  expression 
has  knowns  and  unknowns.  The  unknown  variables  are  associated  with  different  elements  of  the 
array  or  graph.  Opt  is  implemented  in  Terra. 
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Opt  automatically  compiles  an  efficient  solver  for  the  given  energy  function.  Our  compiler 
automatically  transforms  these  specifications  into  state-of-the-art  GPU  solvers  based  on  Gauss- 
Newton  or  Levenberg-Marquardt  methods.  The  compiler  has  several  unique  components.  (1)  We 
have  designed  a  new  method  for  optimizing  symbolic  tensor  expressions  based  on  Einstein 
summation  notation.  We  have  implemented  a  program  analyzer  for  these  expressions,  and 
generate  optimized  code  for  different  types  of  sparse  matrices.  (2)  We  have  developed  a  method 
for  auto-differentiation  of  symbolic  tensor  expressions. 

We  did  an  extensive  evaluation  of  the  system.  This  included  comparing  our  approach  to  existing 
DSLs  such  as  Ebb  (developed  at  Stanford  University)  and  Simit  (developed  at  Massachusetts 
Institute  of  Technology).  Opt  solves  for  unknowns  100-600  times  faster  than  previous 
approaches.  The  main  reason  Opt  is  that  much  faster  is  that  it  compiles  a  specialized  solver  that 
does  not  materialize  a  matrix  (we  call  this  a  matrix-free  solver). 

Opt  is  available  as  open  source  and  we  have  had  roughly  100  downloads. 
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RESULTS  AND  DISCUSSION 


WHAT  DO  WE  GET  FROM  D&R?  DEEP  ANALYSIS 

We  get  deep  analysis,  as  described  above,  even  when  the  data  are  big  and  computational 
complexity  is  high.  This  includes  visualization  of  the  detailed  data,  critical  to  statistical  model 
building  and  validation,  and  to  determining  if  a  machine  learning  method  is  appropriate  for  the 
visualized  patterns  in  the  data.  We  do  visualization  by  applying  a  method  to  subsets  that  have 
the  detailed  data.  While  it  is  feasible  typically  to  apply  a  visualization  method  to  all  subsets,  it  is 
often  not  practical  to  look  at  them  all  because  there  can  be  far  too  many  subsets,  which  in 
applications  can  be  tens  of  thousands  to  the  millions.  So  we  sample.  Sampling  plans  that 
preserve  the  information  in  the  data  can  be  readily  devised  because  we  can  compute  sampling 
variables  across  all  subsets. 


WHAT  DO  WE  GET  FROM  D&R?  HIGH  COMPUTATIONAL  PERFORMANCE 

We  get  high  computational  performance.  DeltaRho  can  increase  dramatically  the  data  size  and 
analytic  computational  complexity  that  are  feasible  in  practice,  whether  hardware  power  of  an 
available  cluster  is  small,  medium,  or  large.  The  data  can  have  a  memory  size  that  is  larger  than 
the  physical  cluster  memory.  For  us  this  occurs  routinely. 


WHAT  DO  WE  GET  FROM  D&R?  ACCESS  TO  METHODS 

We  get  access  to  the  1000s  of  methods  of  statistics,  machine  learning,  and  data  visualization. 


WHAT  DO  WE  GET  FROM  D&R?  HIGH  EFFICIENCY  PROGRAMMING  BY  THE 
ANALYST 

We  get  very  high  efficiency  in  using  R  and  datadr  to  program  with  the  data,  along  with  a  great 
power  and  flexibility  that  allows  deep  analysis  and  tailoring  analyses  to  the  data.  Most 
importantly,  DeltaRho  protects  the  analyst  from  the  detail  of  distributed  parallel  computation 
and  subset  database  management.  Furthermore,  datadr  is  abstracted  from  back  end  choices,  so 
that  its  code  is  the  same  whatever  the  back  end.  For  example,  you  can  use  datadr  on  a  single 
multicore  machine.  Of  course,  back  ends  other  than  Hadoop  require  other  software  that  connects 
datadr  and  a  back  end  like  RHIPE  does  for  Hadoop. 


WHAT  DO  WE  GET  FROM  D&R?  THE  PRICE  IS  EXCELLENT 

It's  free.  The  software  is  all  open  source.  See  www.deltarho.org  to  get  information  on 
downloading  the  software,  installing,  and  documentation  to  program  datadr. 
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CONCLUSIONS 


This  project  involved  the  development  and  evaluation  of  a  new  framework  for  data  analysis, 
Divide  and  Recombine  (D&R).  In  D&R,  the  analyst  divides  the  data  into  subsets,  applies  an 
analytic  method  to  each  subset,  and  then  combines  the  output  of  the  analysis  into  a  final  result. 
The  advantage  of  D&R  is  that  it  is  easy  to  parallelize,  and  hence  scalable  to  large  datasets. 

D&R  poses  interesting  theoretical  problems  in  statistics,  and  this  research  project  has  developed 
ways  to  optimize  the  divide  and  recombine  steps.  Furthermore,  an  extensive  suite  of  open  source 
tools  have  been  developed  and  released  to  the  community.  Finally,  D&R  has  been  used 
extensively  in  the  XDATA  project  to  perform  data  analysis  of  important  datasets,  and  has  shown 
to  be  a  valuable  tool  in  the  data  scientists  toolchest. 

The  project  also  had  two  noteworthy,  but  smaller  projects.  The  first  involved  new  tools  and 
methods  for  scalable  visualization,  and  the  second  involved  methods  for  developing  domain 
specific  languages  for  array  processing  and  optimization.  Again,  all  this  software  has  been 
released  to  the  community  as  open  source. 
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